Skip to content

[KVCache] Support only flush FD GPU Cache index by AttentionStore#7609

Merged
Jiang-Jia-Jun merged 1 commit intoPaddlePaddle:developfrom
jackyYang6:kvcache/as_only_flush
Apr 28, 2026
Merged

[KVCache] Support only flush FD GPU Cache index by AttentionStore#7609
Jiang-Jia-Jun merged 1 commit intoPaddlePaddle:developfrom
jackyYang6:kvcache/as_only_flush

Conversation

@jackyYang6
Copy link
Copy Markdown
Contributor

@jackyYang6 jackyYang6 commented Apr 24, 2026

Motivation

This PR improves the FD_AS_ONLY_FLUSH flow for AttentionStore so FastDeploy can flush KV cache index state when GPU cache blocks are evicted, especially in pure-GPU cache deployments without CPU cache.

It adds the required flush metadata to support more accurate AttentionStore index updates for GPU eviction.

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

  • Extend WriteStorageTask with:
    • flush_cache_exists to indicate whether cache still exists on the current node in FD_AS_ONLY_FLUSH mode.
    • start_write_block_idx to support partial flush/write from a specified block index.
  • Update cache_transfer_manager AS-only flush path to call AttentionStore.flush_token_index(...) with both start_write_block_idx and reside_in_gpu.
  • Propagate FD_AS_ONLY_FLUSH to the cache transfer manager subprocess.
  • Update prefix_cache_manager.free_block_ids_async(...) to emit flush-only tasks when GPU cache blocks are directly evicted in FD_AS_ONLY_FLUSH + attention_store mode.
  • Add FD_AS_ONLY_FLUSH environment variable entry in fastdeploy/envs.py.
  • Add unit test coverage for GPU eviction flush behavior, including:
    • flush_cache_exists=False
    • empty gpu_block_ids in flush-only mode
    • correct start_write_block_idx=depth-1

Usage or Command

For FD_AS_ONLY_FLUSH mode with AttentionStore:

export FD_AS_ONLY_FLUSH=1

Reference test command:

python3 -m pytest tests/cache_manager/test_prefix_cache_manager.py -q

Accuracy Tests

N/A. This PR does not change model forward results or kernel numerical behavior. It only updates KV cache index flush metadata and adds unit tests for cache manager behavior.

Checklist

  • Add at least a tag in the PR title.
    • Suggested title: [KVCache] Support flush FD GPU/CPU Cache index by AttentionStore
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag. (N/A for current develop PR)

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented Apr 24, 2026

Thanks for your contribution!

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 24, 2026

Codecov Report

❌ Patch coverage is 51.51515% with 16 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@4c8f7df). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/cache_manager/cache_transfer_manager.py 17.64% 13 Missing and 1 partial ⚠️
fastdeploy/cache_manager/prefix_cache_manager.py 84.61% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7609   +/-   ##
==========================================
  Coverage           ?   71.66%           
==========================================
  Files              ?      419           
  Lines              ?    57885           
  Branches           ?     9085           
==========================================
  Hits               ?    41485           
  Misses             ?    13569           
  Partials           ?     2831           
Flag Coverage Δ
GPU 71.66% <51.51%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jackyYang6 jackyYang6 changed the title [KVCache] Support only flush FD GPU/CPU Cache index by AttentionStore [KVCache] Support only flush FD GPU Cache index by AttentionStore Apr 27, 2026
@jackyYang6 jackyYang6 force-pushed the kvcache/as_only_flush branch from 22dab4e to 429ed50 Compare April 27, 2026 16:20
PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-04-28 00:33:47

📋 Review 摘要

PR 概述:在 FD_AS_ONLY_FLUSH 模式下支持 GPU Cache 块被驱逐时通过 AttentionStore 刷新 KV 缓存索引,跳过实际数据写入,仅更新索引状态。

变更范围fastdeploy/cache_manager/(cache_tasks、cache_transfer_manager、prefix_cache_manager)、fastdeploy/envs.pytests/cache_manager/

影响面 Tag[KVCache]


📝 PR 规范检查

标题 [KVCache] Support only flush FD GPU Cache index by AttentionStore 含有效官方 Tag,描述结构完整(Motivation / Modifications / Usage / Accuracy Tests / Checklist 均已填写),整体合规。✓


问题

级别 文件 概述
🟡 建议 fastdeploy/cache_manager/cache_transfer_manager.py:989 write_back_storage_taskFD_AS_ONLY_FLUSH 缺少 backend 类型守卫,非 attention_store 后端下会静默跳过所有写操作
❓ 疑问 fastdeploy/cache_manager/prefix_cache_manager.py:1452 hash_value_flush_info 只保留 min_depth 节点的 token_ids,需确认其长度能覆盖更深层被驱逐的 block
❓ 疑问 fastdeploy/cache_manager/prefix_cache_manager.py:1251 is_sync=False_flush_only_storage_task 子进程仍发送 put_transfer_done_signal,请确认消费侧无孤立信号积压风险

总体评价

整体设计思路清晰,AS-only flush 路径实现合理,单测覆盖了核心 eviction 场景。主要需关注 write_back_storage_task 中缺少 backend 类型守卫的潜在风险,以及 token_ids 覆盖范围和孤立信号两个边界问题,建议作者确认后合入。

self.storage_backend
), f"storage_backend not initialized, storage_backend_type: {self.storage_backend_type}"

if envs.FD_AS_ONLY_FLUSH:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 write_back_storage_task 中对 FD_AS_ONLY_FLUSH 的检查缺少 storage_backend_type 过滤

当前代码在 FD_AS_ONLY_FLUSH=True 时无条件 early return 到 _flush_only_storage_task,但该函数内部只在 storage_backend_type == "attention_store" 时执行真正的 flush,其他 backend 下整个 try 块是 no-op。这意味着若用户在 non-attention_store 后端下误开 FD_AS_ONLY_FLUSH=1,所有写操作都会被静默跳过,缓存永久丢失且无报错。

建议在此处加 backend 类型检查,或在 _flush_only_storage_task 中对非 attention_store backend 显式抛出异常:

if envs.FD_AS_ONLY_FLUSH:
    if self.storage_backend_type != "attention_store":
        raise ValueError(
            f"FD_AS_ONLY_FLUSH is only supported with attention_store backend, "
            f"but got: {self.storage_backend_type}"
        )
    return self._flush_only_storage_task(task)

self.gpu_lru_leaf_set.remove(node)
if self.cache_config.num_cpu_blocks < need_block_num:
if node.shared_count == 0 and node.is_gpu_leaf_node: # 直接回收
if envs.FD_AS_ONLY_FLUSH and self.kvcache_storage_backend == "attention_store":
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 hash_value_flush_info 只保留 min_depth 节点,token_ids 取自最浅节点

当同一 input_hash_value 下有多个不同深度的节点被批量驱逐(如 depth=2,3,4 都命中)时,这里只保留 min_depth 节点的 token_ids,最终只发送一个 flush task:start_write_block_idx=min_depth-1

已查阅 attention_store.flush_token_index 实现,确认其语义是「从 start_block_idx 到末尾的所有 block 状态都更新」,所以一次 flush 可以覆盖从最浅节点到叶节点的全部范围。此逻辑正确。

但有一个边界问题需要作者确认:最浅节点的 input_ids(token_ids)是否包含足够长的序列,使 SDK 能正确定位到更深层的 block?input_ids 只编码到 min_depth 对应的 block,SDK 可能无法覆盖更深的驱逐范围。

raise ValueError(err_msg)

self.task_write_back_event[task.task_id] = Event()
if is_sync:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 is_sync=False 时未创建 Event,但 _flush_only_storage_task 子进程仍会调用 put_transfer_done_signal

free_block_ids_async 中通过 issue_write_back_storage_task(flush_task, is_sync=False) 触发 flush,此时主进程不创建 task_write_back_event,也不等待完成。但子进程的 _flush_only_storage_task 在执行后仍会调用 put_transfer_done_signal(result),这个信号在主进程端找不到对应的 Event 接收者,会被静默忽略。

在 GPU 驱逐频繁时(如大量 prefix 命中后的 eviction)可能积累较多孤立信号。请确认 put_transfer_done_signal 的消费侧逻辑在找不到对应 task_id 时是否完全安全(无内存泄漏、无锁死)。

@Jiang-Jia-Jun Jiang-Jia-Jun merged commit d92cad9 into PaddlePaddle:develop Apr 28, 2026
36 of 40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants